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Abstract 

Given n noisy samples with p dimensions, where n < p, we show that the multi-step thresholding pro- 
cedure based on the Lasso - we call it the Thresholded Lasso, can accurately estimate a sparse vector 
(3 G K p in a linear model Y = X/3 + e, where X nxp is a design matrix normalized to have col- 
umn £2 norm yfn, and e ~ N(0,a 2 I n )- We show that under the restricted eigenvalue (RE) condition 
(Bickel-Ritov-Tsybakov 09), it is possible to achieve the £2 loss within a logarithmic factor of the ideal 
mean square error one would achieve with an oracle while selecting a sufficiently sparse model - hence 
achieving sparse oracle inequalities; the oracle would supply perfect information about which coordi- 
nates are non-zero and which are above the noise level. In some sense, the Thresholded Lasso recovers 
the choices that would have been made by the £ penalized least squares estimators, in that it selects a 
sufficiently sparse model without sacrificing the accuracy in estimating f3 and in predicting X/3, We also 
show for the Gauss-Dantzig selector (Candes-Tao 07), if X obeys a uniform uncertainty principle and if 
the true parameter is sufficiently sparse, one will achieve the sparse oracle inequalities as above, while 
allowing at most so irrelevant variables in the model in the worst case, where so < s is the smallest 
integer such that for A = \J2 log p/n, Y^i=i min(/3 2 , A 2 ct 2 ) < soA 2 ct 2 . Our simulation results on the 
Thresholded Lasso match our theoretical analysis excellently. 

Keyword. Linear regression, Lasso, Gauss-Dantzig Selector, i\ regularization, £q penalty, multiple-step 
procedure, ideal model selection, oracle inequalities, restricted orthonormality, statistical estimation, thresh- 
olding, linear sparsity, random matrices 



1 Introduction 

In a typical high dimensional setting, the number of variables p is much larger than the number of ob- 
servations n. This challenging setting appears in linear regression, signal recovery, covariance selection 

* A preliminary version of this paper with title: Thresholding Procedures for High Dimensional Variable Selection and Statistical 
Estimation, has appeared in Proceedings of Advances in Neural Information Processing Systems 22, (NIPS 2009). This research 
was supported by the Swiss National Science Foundation (SNF) Grant 20PA2 1-120050/1. 
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in graphical modeling, and sparse approximations. In this paper, we consider recovering (3 G MP in the 
following linear model: 

Y = Xf3 + e, (1.1) 

where X is an n x p design matrix, Y is a vector of noisy observations and e is the noise term. We assume 
throughout this paper that p > n (i.e. high-dimensional), e ~ N(0,a 2 I n ), and the columns of X are 
normalized to have £2 norm y/n. Given such a linear model, two key tasks are to identify the relevant 
set of variables and to estimate (3 with bounded £2 loss. In particular, recovery of the sparsity pattern 
S = supp(/3) := {j : /3j ^ 0}, also known as variable (model) selection, refers to the task of correctly 
identifying the support set (or a subset of "significant" coefficients in j3) based on the noisy observations. 

Even in the noiseless case, recovering j3 (or its support) from (X, Y) seems impossible when n <C p. 
However, a line of recent research shows that when /3 is sparse: when it has a relatively small num- 
ber of nonzero coefficients and when the design matrix X is also sufficiently nice, it becomes possi- 
ble Candes et al. (2006); Candes and Tao (2005, 2006); Donoho (2006a). One important stream of re- 
search, which we also adopt here, requires computational feasibility for the estimation methods, among 
which the Lasso and the Dantzig selector are both well studied and shown with provable nice statisti- 
cal properties; see for example Bickel et al. (2009); Candes and Tao (2007); Greenshtein and Ritov (2004); 
Meinshausen and Buhlmann (2006); Meinshausen and Yu (2009); Ravikumar et al. (2008); van de Geer (2008); 
Wainwright (2009b); Zhao and Yu (2006). For a chosen penalization parameter A n > 0, regularized esti- 
mation with the ^i-norm penalty, also known as the Lasso (Tibshirani, 1996) or Basis Pursuit (Chen et al., 
1998) refers to the following convex optimization problem 

^=argmmi-||y-X/3||2 + A n ||/3|| 1 , (1.2) 

where the scaling factor l/(2n) is chosen by convenience; The Dantzig selector (Candes and Tao, 2007) is 
defined as, 



(DS) arg min 



subject to 



-X T (Y -XP) 
n 



< K- (1.3) 
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Our goal in this work is to recover S as accurately as possible: we wish to obtain f3 such that | supp (j3) \ S\ 
(and sometimes |SAsupp(/3)| also) is small, with high probability, while at the same time ||/3 — is 
bounded within logarithmic factor of the ideal mean square error one would achieve with an oracle which 
would supply perfect information about which coordinates are non-zero and which are above the noise level 
(hence achieving the oracle inequality as studied in Candes and Tao (2007); Donoho and Johnstone (1994)); 
We deem the bound on ^-loss as a natural criteria for evaluating a sparse model when it is not exactly S. 
Lets = \S\. 

Given T C {1, ... ,p}, let us define Xt as the n x \T\ submatrix obtained by extracting columns of X 
indexed by T; similarly, let 0t £ M' T ', be a subvector of (3 G MP confined to T. Formally, we propose and 
study a Multi-step Procedure: First we obtain an initial estimator /3i n i t using the Lasso as in (1.2) or the 
Dantzig selector as in (1.3), with X n = da\j2\ogp/n, for some constant d > 0. 
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1. We then threshold the estimator with to, with the general goal such that, we get a set 1\ with car- 
dinality at most 2s; in general, we also have \I\ U S\ < 2s, where Ii = {j G {1, . . . ,p} : /3 Jt i n i t > to} 
for some to to be specified. Set I = I\. 

2. We then feed (Y, Xj) to either the Lasso estimator as in (1.2) or the ordinary least squares (OLS) 
estimator to obtain /3, where we set /3j = (XfXj)~ 1 XjY and f3jc = 0. 

3. Possibly threshold with t\ = 4A n y / |ii| to obtain I2, and repeat step 2 with I = I2 to obtain 
set other coordinates to zero and return (3. 

Our algorithm is constructive in that it relies neither on the unknown parameters s and ft m - m := minjgs \(3j\, 
nor the exact knowledge of those that characterize the incoherence conditions on X; instead, our choice of 
A n and thresholding parameters only depends on a, n, and p, and some crude estimation of certain param- 
eters, which we will explain in later sections. In our experiments, we apply only the first two steps with 
the Lasso as an initial estimator, which we refer to as the Thresholded Lasso estimator; the Gauss-Dantzig 
selector is a two-step procedure with the Dantzig selector as f3- m n Candes and Tao (2007). We apply the third 
step only when /3 m j n is sufficiently large, so as to get a very sparse model I D S (cf. Theorem 1.1). We 
now formally define some incoherence conditions in Section 1.1 and elaborate on our goals in Section 1.2, 
where we also outline the rest of this section. 



1.1 Incoherence conditions 

For a matrix A, let A m - m (A) and A max (j4) denote the smallest and the largest eigenvalues respectively. We 
refer to a vector u£R p with at most s non-zero entries, where s < p, as a s-sparse vector. Occasionally, we 
use /3t £ Ml T l, where T C {1, . . . ,p}, to also represent its 0-extended version j3' £ W such that j3' Tc = 
and /3!j- = (3tI for example in (1.10) below. We assume 

A min (2s) = min > 0, (1.4) 

ti^0;2s— sparse 77, 1 1 1 1 

where n > 2s is necessary, as any submatrix with more than n columns must be singular. In general, we 
also assume that 

A ll^^llo 

Amax(2s) = i max - „ < oo. (1.5) 

t>^0;2s— sparse ji ||?j||2 

Candes and Tao (2005) define the s-restricted isometry constant 5 S of X to be the smallest quantity such 
that for all T C {1, . . . , p} with \T\ < s and coefficients sequences (vj)j & T, it holds that 

(1 - 5s) \\v\\l < \\X T v\\l In < (1 + 5 a ) \\vf 2 ; (1.6) 

The (s, s')-restricted orthogonality constant 9 s>s i is the smallest quantity such that for all disjoint sets 
T,T' C {1, . . . ,p} of cardinality \T\ < s and \T'\ < s', 

\(X T c,X T ,c>)\ ii /M n _ 

— < s ,s' ||e|| 2 \\c \\ 2 (1.7) 
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holds, where s + s' < p. Note that 9 s s i and d s are non-decreasing in s, s' and small values of 6 SjS i indicate 
that disjoint subsets covariates in Xj- and Xj" span nearly orthogonal subspaces (See Lemma 5.4 for a 
general bound on 6 SjS '.) For S s , it holds that 1 — 5 S < A m ; n (s) < A max (s) < 1 + 5 S . Hence 82s < 1 implies 
that condition ( 1 .4) holds. As a consequence of these definitions, for any subset I, we have 

A max (|/|) > A max {XfXj/n) > A min {xfXj/n) > A min (|/|) (1.8) 

where A m i n (|/|) > A m j n (2s) > and A max (|/|) < A max (2s) for |/| < 2s. We next introduce some 
conditions on the design, namely, the Restricted Eigenvalue (RE) condition by Bickel et al. (2009) and the 
Uniform Uncertainly Principle by Candes and Tao (2007) which we use throughout this paper. 
Assumption 1.1. (Restricted Eigenvalue Condition RE(s, ko, X) (Bickel et al., 2009)) For some integer 
1 < s < p and a number k$ > 0, it holds for all v 7^ 0, 

1 a IIA^IU 

min min — > 0. (1-9) 



K(s,k ) j q{i P} , 11 11 <fc 11 II V^II^Joll 2 
|Jo|<s II J olli- u ll J olli 

Assumption 1.2. (A Uniform Uncertainly Principle) (Candes and Tao, 2007) For some integer 1 < s < 
n/3, assume 82s + 8s,2s < 1. which implies that A m i n (2s) > 9 St 2s given that 1 — ^2s < A m ; n (2s). 

If RE(s, ko, X) is satisfied with k$ > 1, then (1.4) must hold; Bounds on prediction loss and l v loss, where 
1 < P < 2, for estimating the parameters are derived for both the Lasso and the Dantzig selector in both 
linear and nonparametric regression models; see Bickel et al. (2009). We now define oracle inequalities 
in terms of £2 loss as explored in Candes and Tao (2007), where they show such inequalities hold for the 
Dantzig selector under the UUP (cf. Proposition 4.1). 



1.2 Oracle inequalities 

Consider the least squares estimators (3j = (XfXj)~ 1 XjY, where |/| < s. Consider the ideal least- 
squares estimator /3° 



/3° = argmin E 

rc{i,..., P }, |/|< s 



2 



(1.10) 



which minimizes the expected mean squared error. It follows from Candes and Tao (2007) that for A max (s) < 



CO 



E ||/3 - p% > min (1, 1/A max (s)) ^ mm(f3f, a 2 /n). (1.11) 



i=l 



Now we check if for A max (s) < 00, it holds with high probability that 

v 

||i8 = O0ogp)£inM$VM s ° that (1.12) 



i=l 



|2 



0(logp)max(l,A max ( S ))E||/3°-/3||^ (1.13) 
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holds in view of (1.11). These bounds are meaningful since 



p 




i=l 



min ||/3-/3/||2 + 
IQ{h-,p} 



n 



represents the squared bias and variance. Define sq as the smallest integer such that 



p 



min(/3 t 2 , \ 2 a 2 ) < sq\ 2 (t 2 , where A = y/2\ogp/n. 



(1.14) 



i=i 



A consequence of this definition is: \f3j\ < Act for all j > sq, if we order |/3i| > | I - - - > \Pp\ (cf. (4.7)). 
We define a quantity A CT)0; p for each a > 0, by which we bound the maximum correlation between the noise 
and covariates of X, which we only apply to X with column £2 norm bounded by y/n; For each a > 0, let 



we have (see Candes and Tao (2007)) P (7^) > 1 - { y /F\ogpp a )- 1 . 

The main theme of our paper is to explore oracle inequalities of the thresholding procedures under conditions 
as described above. For the Lasso estimator and the Dantzig selector, under the sparsity constraint, such or- 
acle results have been obtained in a line of recent work for either the prediction error or the £ p loss, where 
1 < P < 2; see for example Bickel et al. (2009); Bunea et al. (2007a,b,c); Cai et al. (2009); Candes and Plan 

(2009) ; Candes and Tao (2007); Koltchinskii (2009a,b); van de Geer and Buhlmann (2009); van de Geer et al 

(2010) ; van de Geer (2008); Zhang and Huang (2008); Zhang (2009) under conditions stated above, or other 
variants. 

Along this line, we prove new results for both the Lasso as an initial estimator and for the thresholded 
estimators. In Section 1.3 and 1.4, we show oracle results for the Thresholded Lasso and the Gauss-Dantzig 
selector in terms of achieving the sparse oracle inequalities which we shall formally define in Section 1.4. 
While the focus of the present paper is on variable selection and oracle inequalities in terms of £2 loss, 
prediction errors are also explicitly derived in Section 1.5; there we introduce the oracle inequalities in terms 
of prediction error and show a natural interpretation for the Thresholded Lasso estimator when relating to the 
£0 penalized least squares estimators, in particular, ones that have been studied by Foster and George (1994); 
see also Barron et al. (1999); Birge and Massart (1997, 2001) for subsequent developments. In Section 1.6, 
we discuss recovery of a subset of strong signals. 

1.3 Variable selection under the RE condition 

Our first result in Theorem 1 . 1 shows that consistent variable selection is possible under the RE condition. 
We do not impose any extra constraint on s besides what is allowed in order for ( 1 .9) to hold. Note that when 
s > n/2, it is impossible for the restricted eigenvalue assumption to hold as Xj for any / such that |/| = 2s 
becomes singular in this case. Hence our algorithm is especially relevant if one would like to estimate a 
parameter j3 such that s is very close to n; See Section 2 for such examples. Our analysis builds upon the 




e: WX+e/nW <X 
11 ' 00 — 



<r,a,p) 



where A 



oVi + a-\/2\ogp/n 



(1.15) 
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rate of convergence bounds for /3j n i t derived in Bickel et al. (2009). The first implication of this work and 
also one of the motivations for analyzing the thresholding methods is: under Assumption 1.1, one can obtain 
consistent variable selection for very significant values of s, if only a few extra variables are allowed to be 
included in the estimator /3. Note that we did not optimize the lower bound on s as we focus on cases when 
the support S is large. 

Theorem 1.1. Suppose that RE(s,ko,X) holds with K(s,ko), where k$ = 1 for the Dantzig selector 
and = 3 for the Lasso. Suppose X n > fX a ,a,pfar \ a ,a,p as in (1.15), where f = I for the Dantzig 
selector, and = 2 for the Lasso. Let s > K 4 (s,ko). Suppose /3 m j n := minj g s|/3j| > B^Xn^/s, where 
B 4 = 4^/2max(K(s, fed), 1) + max (4if 2 (s, ko), \/2//A m j n (2s)J. Then on T a , the multi-step procedure 
returns ^ such that for B 3 = (1 + a)(l + l/(16/ 2 A^ in (2s))), 

S CI := supp(/3), where \I\S\ < l/(16/ 2 A 2 nin (2s)) and 
WP-PWl < A^l/I/A^nd/I) < B 3 (2logp/n)sa 2 /(A 2 min (\I\)). 

In Section 7, our simulation results using the Thresholded Lasso show that the exact recovery rate of the sup- 
port is very high for a few types of random matrices once the number of samples passes a certain threshold. 
We note that the oracle inequality as in ( 1 . 1 2) is also achieved given that (3 m - m > a/ ^fn; hence Yli=i min(/3 2 ,a 2 /n) = 
sa 2 /n. We next extend model selection consistency beyond the notion of exact recovery of the support set 
5 as we introduced earlier, which has been considered in Meinshausen and Buhlmann (2006); Wainwright 
(2009b); Zhao and Yu (2006); Instead of having to make strong assumptions on either the signal strength, 
for example, on (3 m - m , or the incoherence conditions (or both), we focus on defining a meaningful criteria 
for model selection consistency when both are relatively weak. 

1.4 Thresholding that achieves sparse oracle inequalities 

The natural question upon obtaining Theorem 1.1 is: is there a good thresholding rule that enables us to 
obtain a sufficiently sparse estimator /3 which satisfies the oracle inequality as in (1.12), when some com- 
ponents of /3s (and hence /3 m i n ) are well below a/yjn! Theorem 1.2 answers this question positively: 
under a uniform uncertainty principle (UUP), thresholding of an initial Dantzig selector p- m n at the level of 
C\ a/2 log p/na for some constant C\ , identifies a sparse model I of cardinality at most 2sq such that its 
corresponding least-squares estimator /3 based on the model I achieves the oracle inequality as in (1.12). 
This is accomplished without any knowledge of the significant coordinates or parameter values of /?. Theo- 
rem 1.3 shows that exactly the same type of sparse oracle inequalities hold for the Thresholded Lasso under 
the RE condition, which is both surprising but also mostly anticipated; this is also the key contribution of this 
paper. For simplicity, we always aim to bound |/| < 2so while achieving the oracle inequality as in (1.12); 
One could aim to bound |/| < cso for some other constant c > 0. We refer to estimators that satisfy both 
constraints as estimators that achieve the sparse oracle inequalities. Moreover, we note that thresholding of 
an initial estimator (3- m i t which achieves £2 loss as in (1.12) at the level of c\a^2 \ogp/n for some constant 
c\ > 0, will always select nearly the best subset of variables in the spirit of Theorem 1.2 and 1.3; Formal 
statements of such results are omitted. 
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Theorem 1.2. (Variable selection under UUP) Choose r, a > and set \ n = X PtT a, where \ P)T := 
+ a + T~ 1 )^/21ogp/n, in (1.3). Suppose f3 is s-sparse with <5 2s + 9 s ,2s < 1 — t. Let threshold t be 
chosen from the range (CiA PiT cr, C^Xp^a] for some constants C\, C4 to be defined. Then with probability 
at least 1 — (y 'it log pp a )~ l , the Gauss-Dantzig selector /3 selects a model I := supp (/3) such that we have 

\I\ < 2so and \I\S\ < sq < s and (1.16) 

\\P-/3\\ 2 2 < 2C3 2 logp^ 2 /n + ^min(/32,a 2 /^ (1.17) 

where C\ is defined in (4.2) and C3 depends on a, r, 62s, 9s,2s an d C4; see (4.3). 

Theorem 1.3. (Ideal model selection for the Thresholded Lasso) Suppose RE(sq,6,X) holds with 
K(sq,6), and conditions (1. 4) and (1.5) hold. Let /3; n i t be an optimal solution to (1.2) with X n = doy / 2logp/na > 
2X at a,p, where a > and do > 2y/l + a. Suppose that we choose to = C4A0", for some constant 
C 4 > D x , where D x = A max (s - s ) + 9K 2 (s ,6)/2; ^ / = {j G {1, . . . ,p} : /3 iiinit > t }. Then 
JbrV := {1,. . . ,p}\I andfa = {XfX^XfY, we have on T a : 

\I\ < s (l + D1/C4) < 2s , |/U S 1 ! < s + s anJ 

||^-/3|| 2 < 2D 3 2 logp(a 2 /n + ^min(/3 2 , C T 2 /n)) 

i=l 

where D3 depends on a, K(sq, 6), -Do awcf Di in (5.2) ant/ (5.3), A m i n (|/|), Sj 2s O ' an d C4; see (5.4). 

Our analysis for Theorem 1.2 builds upon Candes and Tao (2007), which show that so long as /3 is suffi- 
ciently sparse the Dantzig selector as in (1.3) achieves the oracle inequality as in (1.12). Note that allow- 
ing to to be chosen from a range (as wide as one would like, with the cost of increasing the constant C3 
in (1.17)), saves us from having to estimate C\, which indeed depends on 62s and 8 Si 2s- The same com- 
ment applies to Theorem 1.3 for D3. Assumption 1.2 implies that Assumption 1.1 holds for ko = 1 with 
K(s, k ) = VAmin(2s)/(A min (2s) -0 S)2s ) < v / A mi n(2s)/(l-<5 2s -^, 2 ,) (seeBickel et al. (2009)). For a 
more comprehensive comparison between these conditions, we refer to van de Geer and Buhlmann (2009). 
We note that RE(sq, 6) is imposed on X with sparsity fixed at so (rather than s) and ko = 6 in Theorem 5.1. 
Important consequences of this result is shown in Section 1.5. The term sparsity oracle inequalities has also 
been used in the literature, which is targeted at bounding prediction errors of the estimators with the best 
sparse approximation of the regression function known by an oracle; see Bickel et al. (2009) and more ref- 
erences therein. It would be interesting to explore such properties for the Thresholded Lasso under the RE 
conditions. 

1.5 Connecting to the £ penalized least squares estimators 

Now why is the bound of |/| < 2«o interesting? We wish to point out that this would make the behavior 
of the Thresholded Lasso procedure somehow mimic that of the £0 penalized estimators, which is com- 
putational inefficient, as we introduce next. It is clear that for the least squares estimator based on I, 
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Pi = (XfX^XfY, it holds that 

- Xpf 2 = \\Pi(X/3 + e) - XP\\l = \\(Pi - \d)Xjc(3jc + P ie \\l (1.18) 
and hence E\\XPi - Xp\\ 2 2 /n = ||(P 7 - IdjXjc/Sjc^ + |/|cr 2 , (1.19) 

which again shows the typical bias and variance tradeoff. Consider the best model Iq upon which /3j = 
{Xj o X Io )~ l XjY achieves the minimum in (1.19): 

7 = argmin || (P/ - ld)X/ c /3/ c || 2 + |/|cr 2 . 

/C{l,...,p} 

Now the question is: can one do nearly as well as /3/ in the sense of achieving mean square error within 
logp factor of E||X/3/ — X/3H 2 ,? It turns out that the answer is yes, if one solves the following £q penalized 
least squares estimator with Ao = y/\ogp/n, as proposed in the RIC procedure (Foster and George, 1994): 

p = argmin||Y - Xp\\ 2 2 /(2n) + \ 2 a 2 \\p\\o, (1.20) 

where \\P\\o is the number of nonzero components in (3. This is shown in a series of papers in Barron et al. 

(1999); Birge and Massart (1997, 2001); Foster and George (1994). We refer to Barron et al. (1999); Foster and George 

(1994) for other procedures related to (1.20). Note that \\Y - Xp\\ 2 < 2\\Xp - Xp\\ 2 + 2||e|||; hence we 

only need to look at the tradeoff between \\X/3 - Xp\\ 2 and logp|I|. Note that \\Xp - Xf3\\% would be 

if f3 = P, but |/| would be large. Theorem 1.4 shows that (a) the thresholded estimators achieve a balance 

between the "complexity" measure logp|7| and ||X/3 — which now have the same order of magnitude; 

(b) and in some sense, variables in model I are essential in predicting Xp. 

Theorem 1.4. Let I be the model selected by thresholding an initial estimator P^, under conditions as 
described in Theorem 1.2 or Theorem 1.3. Let T> := {1, . . . ,p} \ I. Let sq be as defined in (1.14) and 
A = y / 2\ogp/n. For Pi = (XfXi)~ 1 XjY and some constant C, we have on T a , 

x% - xp 



Comparing (1.20) and (1.2), it is clear that for entries pjjmt < in a Lasso estimator, their contributions 
to the optimization function in (1.20) will be larger than that in (1.2) if A n = \qo~, hence removing these 
entries from the initial estimator in some sense recovers the choices that would have been made by the 
complexity-based function as in (1.20). Put in another way, getting rid of variables {j : Pj ; mit < Ao<r} 
from the solution to (1.2) with A n x Xoa is in some way restoring the behavior of (1.20) in a brute- 
force manner. Proposition 1.5 (by setting d = 1) shows that the number of variables in P at above and 
around \f\ogp/na in magnitude is bounded by 2so (One could choose another target set: for example, 
{j : \Pj\ > \/^ogp/ (c'n)a}, for some d > 1/2.) Roughly speaking, we wish to include most of them by 
leaving 2so variables in the model I. Such connections will be made precise in our future work. 

1.6 Controlling Type II errors 

In Section 6 (cf. Theorem 6.3), we show that we can recover a subset Sl of variables accurately, where 
Sl '■= {j '■ \Pj\ > \/2logp/na}, under Assumption 1.1 when p m - m> s L := min je 5 £ \Pj\ is lai - ge enough 
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(relative to the £2 loss of an initial estimator under the RE condition on the set Si)', in addition, a small 
number of extra variables from {1, . . . ,p} \ Tq =: Tq are possibly also included in the model /, where 
To denotes positions of the so largest coefficients of (3 in absolute values. In this case, it is also possible 
to get rid of variables from Tq entirely by increasing the threshold to while making the lower bound on 
/3min,5 L a constant times stronger. We omit such details from the paper. Hence compared to Theorem 1.1, 
we have relaxed the restriction on /3 m i n : rather than requiring all non-zero entries to be large, we only require 
those in a subset Sl to be recovered to be large. In addition, we believe that our analysis can be extended 
to cases when /3 is not exactly sparse, but has entries decaying like a power law, for example, as studied 
by Candes and Tao (2007); We end with Proposition 1.5. For a set A, we use \A\ to denote its cardinality. 
Proposition 1.5. Let Tq denote positions of the s$ largest coefficients of (3 in absolute values, where sq is de- 
fined in (1.14). Leta = \S L \ (cf (6.1)). Then Vc' > 1/2, we have {j £ T C : \pA > ^logp/(dn)a} < 

(2d - l)(a„ - ao). 



1.7 Previous work 

We briefly review related work in multi-step procedures and the role of sparsity for high-dimensional sta- 
tistical inference. Before this work, hard thresholding idea has been shown in Candes and Tao (2007) (via 
Gauss-Dantzig selector) as a method to correct the bias of the initial Dantzig selector. The empirical suc- 
cess of the Gauss-Dantzig selector in terms of improving the statistical accuracy is strongly evident in their 
experimental results. Our theoretical analysis on the oracle inequalities, which hold for the Gauss-Dantzig 
selector under a uniform uncertainty principle, builds upon their theoretical analysis of the initial Dantzig 
selector under the same condition. For the Lasso, Meinshausen and Yu (2009) has also shown in theoretical 
analysis that thresholding is effective in obtaining a two-step estimator /3 that is consistent in its support with 
ft when f3 m - m is sufficiently large; As pointed out by Bickel et al. (2009), a weakening of their condition is 
still sufficient for Assumption 1 . 1 to hold. 

The sparse recovery problem under arbitrary noise is also well studied, see Candes et al. (2006); Needell and Tropp 
(2008); Needell and Vershynin (2009). Although as argued in Candes et al. (2006) and Needell and Tropp 
(2008), the best accuracy under arbitrary noise has essentially been achieved in both work, their bounds are 
worse than that in Candes and Tao (2007) (hence the present paper) under the stochastic noise as discussed in 
the present paper; Moreover, greedy algorithms in Needell and Tropp (2008); Needell and Vershynin (2009) 
require s to be part of the input, while algorithms in the present paper do not have such a requirement, and 
hence adapt to the unknown level of sparsity well. A more general framework on multi-step variable se- 
lection was studied by Wasserman and Roeder (2009). They control the probability of false positives at the 
price of false negatives, similar to what we aim for here; their analysis is constrained to the case when s is 
a constant. Recently, another two-stage procedure that is also relevant has been proposed in Zhang (2009), 
where in the second stage "selective penalization" is being applied to the set of irrelevant features which 
are defined as those below a certain threshold in the initial Lasso estimator; Incoherence conditions there 
are sufficiently different from the RE condition as we study in this paper for the Thresholded Lasso. Un- 
der conditions similar to Theorem 1.1, Zhou et al. (2009) requires s = 0(yn/ logp) in order to achieve 
variable selection consistency using the adaptive Lasso (Zou, 2006) (see also Huang et al. (2008)), as the 



9 



second step procedure. Concurrent with the present work, the authors have revisited the adaptive Lasso and 
derived bounds in terms of prediction error van de Geer et al. (2010); there the number of false positives is 
also aimed at being in the same order as that of the set of significant variables which predicts X/3 well; in 
addition, the adaptive Lasso method is compared with thresholding methods, under a stronger incoherence 
condition than the RE condition studied in the present paper. While the focus of the present paper is on 
variable selection and oracle inequalities for the £ 2 loss, prediction errors of the OLS estimators /3 are also 
explicitly derived; We also compare the performance in terms of variable selections between the adaptive 
and the thresholding methods in our simulation study, which is reported in Section 7. 

Parts of this work was presented in a conference paper Zhou (2009b). The current version expands the orig- 
inal idea and elaborates upon the conceptual connections between the Thresholded Lasso and £q penalized 
methods; in addition, we provide new results on the sparse oracle inequalities under the RE condition (cf. 
Theorem 1.3, Theorem 5.1 and Theorem 6.3). 

1.8 Organization of the paper 

Section 2 briefly discusses the relationship between linear sparsity and random design matrices, while high- 
lighting the role thresholding plays in terms of recovering the best subset of variables, when s is a linear 
fraction of re, which in turn is a nonnegligible fraction of p. We prove Theorem 1.1 essentially in Section 3. 
A thresholding framework for the general setting is described in Section 4, which also sketches the proof 
of Theorem 1.2. The proof of Theorem 1.3 is shown in Section 5, where oracle inequalities for the original 
Lasso estimator is also shown. In Section 6, we show conditions under which one recovers a subset of 
strong signals. Section 7 includes simulation results showing that the Thresholded Lasso is consistent with 
our theoretical analysis on variable selection and on estimating /3. Most of the technical proofs are included 
in the Appendix. 

2 Linear sparsity and random matrices 

A special case of design matrices that satisfy the Restricted Eigenvalue assumption are the random design 
matrices. This is shown in a large body of work, for example Baraniuk et al. (2008); Candes et al. (2006); 
Candes and Tao (2005, 2007); Donoho (2006b); Mendelson et al. (2008); Szarek (1991), which shows that 
the UUP holds for "generic" or random design matrices for very significant values of s. It is well known that 
for a random matrix the UUP holds for s >c re/ log(p/n) with i.i.d. Gaussian random variables, subject to 
normalizations of columns, the Bernoulli, and in general the subgaussian random ensembles Baraniuk et al. 
(2008); Mendelson et al. (2008); Adamczak et al. (2009) show that UUP holds for s x re/ log 2 (p/re) when 
X is a random matrix composed of columns that are independent isotropic vectors with log-concave densi- 
ties. Hence this setup only requires Cs observations per nonzero value in f3, where C is a small constant, 
when n is a nonnegligible fraction of p, in order to recover /3; we call this level of sparsity the linear sparsity. 
Our simulation results in Section 7 show that once re > Cs\og(p/n), where C is a small constant, exact 
recovery rate of the sparsity pattern is very high for Gaussian (and Bernoulli) random ensembles, when 
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/3min is sufficiently large; this shows a strong contrast with the ordinary Lasso, for which the probability of 
success in terms of exact recovery of the sparsity pattern tends to zero when n < 2s log(p — s) (Wainwright, 
2009b). 

A series of recent papers Raskutti et al. (2009); Zhou (2009a); Zhou et al. (2009) show that a broader class 
of subgaussian random matrices also satisfy the Restricted Eigenvalue condition; In particular, Zhou (2009a) 
shows that for subgaussian random matrices ^ which are now well known to satisfy the UUP condition under 
linear sparsity, RE condition holds for X := 'I'E 1 / 2 with overwhelming probability with n >c slog(p/n) 
number of samples, where £ is assumed to satisfy the follow condition: Suppose Sjj = 1, Vj = 1, . . . ,p, 
and for some integer 1 < s < p and a positive number k$, the following condition holds for all v ^ 0: 
1 



mm mm 



K(s,k ,E) ' j c{i,..., P }, || || <fen ||„ 7 



\VJ0W2 > 0. 



Thus the additional covariance structure £ is explicitly introduced to the columns of ^ in generating X. We 
believe similar results can be extended to other cases: for example, when X is the composition of a random 
Fourier ensemble, or randomly sampled rows of orthonormal matrices, see for example Candes and Tao 
(2006, 2007); Rudelson and Vershynin (2006), where the UUP holds for s = 0(n/ log c p) for some constant 
c> 0. 



3 Thresholding procedure when m - m is large 

In this section, we use a penalization parameter A„ > B\ a ^ atP and assume /3 m i n > C\ n yfs for some 
constants B, C; we first specify the thresholding parameters in this case. We then show in Theorem 3.1 that 
our algorithm works under any condition so long as the rate of convergence of the initial estimator obeys 
the bounds in (3.2). Theorem 1.1 is a corollary of Theorem 3.1 under Assumption 1.1, given the rate of 
convergence bounds for /3j n i t following derivations in (Bickel et al., 2009). 

The Iterative Procedure. We obtain an initial estimator using the Lasso or the Dantzig selector. Let 
So = {j : /3j,init > 4A n }, and /3(°) := /3; n i t ; Iterate through the following steps twice, for i = 0, 1: (a) Set 

U = A\ n \j\Si\; (b) Threshold /3® with ti to obtain / := S^+i, where 

S l+1 = U G S t : jsf > 4A n ^^| (3.1) 

and compute f3j l+1 ^ = (Xj Xi)~ l XjY '. Return the final set of variables in £2 and output j3 such that 

% =^ ) and^ = 0,VjG5f. 

Theorem 3.1. Let \ n > BX aA:P , where B > 1 is a constant suitably chosen such that the initial estimator 
Pint satisfies on some event Qb, for = fi- m n — [3, 

||uimt,s|| 2 ^ B \ n y/s and \\Put,S^ Hi < Bi\ n s (3.2) 
where Bq, B\ are some constants. Suppose for B2 = l/(BA m [ n (2s)), 

/3min > ( max ( y/B[, 2) 2V2 + max (b ,V2B 2 ) ) A n ^. (3.3) 
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Then for s > Bf/1Q, it holds on 7^ H Qb that, (a): \/i = 1, 2, \Si\ < 2s; and (b): 



(0 



< A 



cr,a,p ' 



\Si\/A min (\Si\) < \ n B 2 V2s 



(3.4) 



where Vz = 1, 2, are the OLS estimators based on Si; Moreover, the Iterative Procedure includes the 
set of relevant variables in S 2 such that S C S 2 C Si and 



S 2 \S := supp(/3)\5 < lAie^A^d^D) < B 2 V16. 



(3.5) 



The proof of Theorem 3. 1 appears in Section D. We now discuss its relationship to theorems in the sub- 
sequent sections. We first note that in order to obtain Si such that \Si\ < 2s and Si D S as above, we 
only need to threshold /3; n i t at to = Bi\ n ; here instead of having to estimate the unknown B\, we can use 
to = co\ n yfs for some constant cq to threshold /3 in ; t . In the general setting, we require that t be chosen 
from the range (CiA„, C^Xn] for some constants Ci, C4 to be specified; see Section 4 (Lemma 4.2) for 
example. We note that without the knowledge of a, one could use a > a in X n ; this will put a stronger re- 
quirement on /3 m in, but all conclusions of Theorem 3.1 hold. When (3 m - m does not satisfy the constraint as in 
Theorem 3. 1, we cannot really guarantee that all variables in S will be chosen. Hence (3.2) will be replaced 
by requirements on To, which denotes locations of the so largest coefficients of (3 in absolute values: ideally, 
we wish to have 



IKAnit - P)t \\ 2 < C A nV 1?uT and H/W^lli ^ c ± x n\T Q \ 



(3-6) 



for some constants Co, C\, so that (1.16) and (1.17) hold under suitably chosen thresholding rules. This is 
the content of Theorem 5. 1 and Theorem 6.3. 



4 Nearly ideal model selections under the UUP 

In this section, we wish to derive a meaningful criteria for consistency in variable selection, when /3 m j n 
is well below the noise level. Suppose that we are given an initial estimator /3j n j t that achieves the oracle 
inequality as in (1.12), which adapts nearly ideally not only to the uncertainty in the support set S but also 
the "significant" set. We show that although we cannot guarantee the presence of variables indexed by 
Sr = {j : \(3j\ < a^/2\ogp/n} to be included in the final set I (cf. (4.7)) due to their lack of strength, we 
wish to include in I most variables in Sl = S \ Sr such that the OLS estimator based on I achieves (1.12) 
even though some non-zero variables are missing from I. Here we pay a price for the missing variables in 
order to obtain a sufficiently sparse model /. Toward this goal, we analyze the following algorithm. 

The General Two-step Procedure: Assume + #s,2s < 1 — r, where r > 0; 

1. First obtain an initial estimator /3j n i t using the Dantzig selector in (1.3) with X n = + a + 
T _ 1 ) y2 log p/na, where a > 0; then threshold /3i n i t with to . chosen from the range (CiAp iT <r, C^Xp T o~\, 
for Ci as defined in (4.2), to obtain a set I of cardinality at most 2so (cf. Lemma 4.2): 
set I := {j € {!,... ,p} : (3 jMit > t } . 
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2. Given a set I as above, run the OLS regression to obtain /3j = (XfXj) 1 XfY and set /3j = 0, Vj 
/. 

In Section 5, we analyze the Thresholded Lasso, where we obtain (3^ via the Lasso under the RE condition 
and follow the same steps as above; see Theorem 5.1 and Lemma 5.2 for the new A n and t to be specified. 
Under the UUP, Candes and Tao (2007) have shown that the Dantzig selector achieves nearly the ideal level 
of £2 loss. We then show in Lemma 4.2 that thresholding at the level of C\Xa at Step 1 selects a set I of at 
most 2so variables, among which at most sq are from S c . 

Proposition 4.1. (Candes and Tao, 2007) Let Y = Xj3 + e, for e being i.i.d. N(0,a 2 ) and \\Xj\\ 2 2 = 

n. Choose r, a > and set X n = + a + r~ 1 )cr^/2 logp/n in (1.3). Then if (3 is s-sparse with 

^ 2 

$2s + 8s,2s < 1 — t, the Dantzig selector obeys with probability at least 1 — (^tt logpp a )~ 1 



2CUVT+^ + t- 1 ) 2 logp(a 2 /n + £f =1 min (p 2 , a 2 /n)) . 



P~P 



< 

2 



From this point on we let 6 := 82s and 9 := 9 S ^ S ', Analysis in Candes and Tao (2007) (Theorem 2) and the 
current paper yields the following constants, 

where C = 2^2 (l + jE^e) + (1 + l/V^^S; We now define 



Ci = C + I l±^ and (4.2) 
C| = 3(^TT^ + r~ 1 ) 2 ((C^ + C 4 ) 2 + l) + 4(l + a)/A^ in (2 S o) (4.3) 

where C3 has not been optimized. Recall that so is the smallest integer such that J2f=i mm (A 2 > A 2 cr 2 ) < 
soA 2 cr 2 , where A = y/2 logp/n. We order the /3/s in decreasing order of magnitude 

|/3x| > N- > |/3 P |. (4.4) 
Thus by definition of so, the fact < so < s, we have for s < p, 

s AV < AV 2 + £min(/3 2 ,AV) < 2 logp ^ + g min (/5 2 , ^ (4.5) 
so+l 

s AV > ^ min(/3 2 , AV) > (s + 1) min(/3 2 )+1 , AV) (4.6) 
i=i 

which implies that (as shown in Candes and Tao (2007)) that min(/3 2 Q+1 , A 2 <r 2 ) < A 2 er 2 and hence by (4.4), 
it holds that 

\Pi\ < Xa for all j> s . (4.7) 

Lemma 4.2. Choose r > such that 82s + s ,2s < 1 — r. Let Pi n n be the solution to (1.3) with X n = 
X p . T o- := (y/1 + a + t~ 1 ) W2 log pjna. Given some constant C4 > C\, for C\ as in (4.2), choose a 
thresholding parameter to such that C^X P:T a > to > C\X PtT a and set I = {j : \Pj t mit\ > to}- Then 
with probability at least P (7^), as detailed in Proposition 4.1, we have (1.16), and for C as in (4.1), 
WPvh < V( C o + C4) 2 + lXp^a^/sQ-, where V :={!,... ,p} \ I. 
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It is clear by Lemma 4.2 that we cannot cut too many "significant" variables; in particular - , for those that are 
> Xut/sq, we can cut at most a constant number of them. Next we show that even if we miss some columns 
of X in S, we can still hope to get the £2 loss as required in Theorem 1.2 so long as ||/3©|| 2 is bounded, 
for example, as bounded in Lemma 4.2, and / is sufficiently sparse. Now Theorem 1.2 is an immediate 
corollary of Lemma 4.2 and 4.3 in view of (4.5). See Section E for its proof. We note that Lemma 4.3 yields 
a general result on the £2 loss for the OLS estimator, when a subset of relevant variables is missing from the 
chosen model /; this is also an important technical contribution of this paper. 

Lemma 4.3. (OLS estimator with missing variables) Suppose that (1.4) and (1.5) hold. Let V := 
{1, . . . ,p} \ I and St> = T> D S such that I n Sj> = 0. Suppose \I U Sx>\ < 2s. Then, for f3j = 
(XfXi)~ 1 XjY, it holds on T a that 



< 



VI\St>\ 



vh+X^aWW) / A min(|J|) + 



'hill ■ 



We note that Lemma 4.3 applies to X so long as conditions (1.4) and (1.5) hold, which guarantees that 
0ijM5.pl is bounded within a reasonable constant, when |/| + \St>\ < 2s (cf. Lemma 5.4). It is clear from 
Lemma 4.3 and Theorem 1.4 that, except for the constants that appear before each term, namely, ||/3r>|| 2 
and y / |T[V2 log pa, the bias and variance tradeoffs for the prediction error and the £2 loss follow roughly 
the same trend in their upper bounds. It will make sense to take a look at the bound on prediction error 
for the Gauss-Dantzig selector stated in Corollary 4.4, which follows immediately from Theorem 1 .4 and 
Lemma 4.2. 

Corollary 4.4. Under conditions in Theorem 1.2, the Gauss-Dantzig selector chooses I, where \I\ < 2so, 
such that for the OLS estimator /3 based on I, we have X(3j — X/3 /y/n < C^y/s^Xa, where C5 = 

VAma*(s)(V(Co + C4) 2 + 1(VT+^ + t' 1 )) + /(J), where /(/) := ^2(1 + a)A max (|/|)/A min (|/|). 



5 On sparse oracle inequalities of the Lasso under the RE condition 

In this section, in order to prove Theorem 1.3, we first show in Theorem 5.1 that under the RE condition, 
the Lasso estimator achieves essentially the same type of oracle properties as the Dantzig selector (under 
UUP). This result is new to the best of our knowledge; it improves upon a result in Bickel et al. (2009) (cf. 
Theorem 7.2) under slightly different RE conditions, and thus may be of independent interests. The sparse 
oracle properties of the Thresholded Lasso in terms of variable selection, £2 loss, and prediction error then 
all follow naturally from Theorem 5.1, Lemma 5.2 and Lemma 4.3 as derived in Section 4. The proof of 
Theorem 5.1 draws upon techniques from a concurrent work in van de Geer et al. (2010), where a stronger 
condition is required, while deriving bounds similar to the present paper. 

Theorem 5.1. (Oracle inequalities of the Lasso) Let Y = X/3 + e, for e being Lid, N(0,a 2 ) and 
||Xj|| 2 = yjn. Let so be as in (1.14) and Tq denote locations of the sq largest coefficients of f3 in absolute 
values. Suppose that RE(so, 6, X) holds with K(so, 6), and (1.4) and (1.5) hold. Let f3i n n be an optimal 
solution to (1.2) with \ n = doXa > 2X a: a tP , where a > and do > 2\/l + a. Let h = /3; n i t — /3t - Then on 
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T a as in (1.15), we have for A max := A max (s - sq), 

||Anit-/3|l2 < 2X 2 a 2 s (D 2 +D 2 1 + l), 

\\ h T \\i + ||#iiit,T c || 1 < + max 1 8iC 2 (s , 6)(io,^^|^ \as , 

WXftrit-XpWs/y/fi < A(r^(0W + 3doif(s o ,6)) 

w/zere Dq, D\ are defined in (5.2) and (5.3). Moreover, for any subset Iq C S 1 , fry assuming that RE(\Iq\, 6, X) 
/joWs wif/i iT(|Io|, 6), we ftave 

||ATA„it - ATy3||l /n < 2 - X/3 7o || 2 /„ + 9X 2 n \I \K 2 (\I \,6). (5.1) 

Let T\ denote the sq largest positions of h in absolute values outside of To; Let Tq\ := Tq U T\. The proof 
of Theorem 5.1 yields the following bounds: for K := K(so,Q), ||/it 01 || 2 < A)Ao"a/so and 1 1 /ir^ 1 1 < 
DiXasQ where 

D = msx{D, KV2(2^/A max (s - s ) + 3d K)}, (5.2) 



where D = (a/2 + ij ^jjEg 



- sn,2sn-A-max(s So) , 

H r ; ; and 



^A min (2s ) A min (2s ) 
D 1 = 2A max (s - s )/do + 9iT 2 d /2. (5.3) 

The proof of Lemma 5.2 follows exactly that of Lemma 4.2, and hence omitted. We then state the bound 
on prediction error for /3 for the Thresholded Lasso, which follows immediately from Theorem 1.4 and 
Lemma 5.2. 

Lemma 5.2. Suppose that X obeys RE(so, 6, X), and conditions (1.4) and (1.5) hold. Let /3j n i t be an 
optimal solution to (1.2) with X n = doXa > 2X aA ^, where a > 0, do > 2\/l + a, and X := a/2 log p/n as 
in Theorem 5.1. Suppose that we choose to = C4A0" for some positive constant C4. Let I = {j : |/3j,init > 
to} and V := {1, . . . ,p} \ I. Then we have on T a , 

\I\ < s (l + Di/C A ) and \IU S\ < s + D^o/Ca and 

\\f3x> || 2 — a/ (-Do + C4) 2 + lAciy'so, where Do, D\ are as defined in (5.2) and (5.3). 

Corollary 5.3. Under conditions in Theorem 1.3, the Thresholded Lasso chooses I, where \I\ < 2so, 



such that for the OLS estimator (3 based on I, it holds that 



XP T - X/3 



/y/n < C^y/soXa, where 



Cq = \/A max (s)y / (Do + C4) 2 + 1 + f(I), for f(I) as defined in Corollary 4.4 and Do is defined in (5.2). 

We now state Lemma 5.4, which follows from Candes and Tao (2005) (Lemma 1.2); we then prove Theo- 
rem 1.3, where we give an explicit expression for D%. 

Lemma 5.4. (Candes and Tao, 2005) Suppose that (1.4) and (1.5) hold. Then for all disjoint sets I, S-p C 
{1, . . . ,p} of cardinality \Sv\ < s and \I\ + \ Sd\ < 2s, 

e W) \Sv\ < (Amax(2s) - A min (2s))/2; 

In particular, ifd 2s < 1, we have ^i^sy < 5|/|+|5j,| < &2s < L 
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Proof of Theorem 1.3. It holds by definition of Su that / n 5p = 0. It is clear by Lemma 5.2 that for 

C 4 > Dx, \I\ < 2s and \I U S v \ < \I U S\ < s + s < 2s, given that \S V \ < s. We have by Lemma 4.3 



< 



*D\\2 



1 + 



+ 



21/1 



A 



A 2 . (1/1) A 2 . (1/1) a ' a ' p 

mm VII// mm VII/ 



< D 2 \ 2 cr 2 s < 2D\ logp ( a 2 jn + ^ min(/3 2 , cr 2 /n) I where 



i=l 



D 2 = ((D + C 4 ) 2 + !)(! + j + 4(1 + a)/A 2 min (\I\). 



□ 



It is clear by Lemma 5.4 that 



D 2 < ((D + C7 4 ) 2 + 1) 1 + 



(A max (2s)-A min (2s)) 2 \ 4(1 + a) 



2A 1 li„(|/| 



+ 



A^ n (|/|) 



(5.4) 



6 Controlling Type-II errors 

In this section, we derive results that are parametrized based on the performance of an initial estimator, 
the smallest magnitude of variables in {j : \/3j\ > Act}, where A := a/2 logp/ra, and the choice of the 
thresholding parameter to- We emphasize that we do not necessarily require that to > Act. We first introduce 
some more notation. Again order the /3j's in decreasing order of magnitude: |/3i| > l/^l-- > Let 
To = {1, ... , so}. In view of (4.7), we decompose To = {1, ... , so} into two sets: ^4o and To \ Aq, where 
Aq contains the set of coefficients of j3 strictly larger than Act, for which we define a constant: 

A o = {j ■ > Act} =: {1, . . . , a }; Let /3 min A ■= min \fiA > Act. (6.1) 

Our goal is to show when fimm,A is sufficiently large, we have Aq C I while achieving the sparse ora- 
cle inequalities; This is shown in Theorem 6.3 under the RE condition, which is stated as a corollary of 
Lemma 6.2. First note that changing the coefficients of (3a w iU not change the values of so or ao, so long 
as their absolute values stay strictly larger than Act. Thus one can increase to as pmm,A increases in order 
to reduce false positives while not increasing false negatives from the set ^4o- In Lemma 6.2, we impose a 
lower bound on f3 m - m .A (6-4) in order to recover the subset of variables in Aq, while achieving the nearly 
ideal £2 loss with a sparse model /. 

We now show In Lemma 6.1 that under no restriction on /3 m i n , we achieve an oracle bound on the £2 loss, 
which depends only on the £2 loss of the initial estimator on the set To. Bounds in Lemma 4.2 and 5.2 are 
special cases (6.2) as we state now. 

Lemma 6.1. Let /3j n i t be an initial estimator. Let h = /3j n i t — (3x and A := y / 2logp/n. Suppose that we 
choose a thresholding parameter to and set 

I = {j: |/y,init| >t }- 
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Then for V := {1, . . . ,p} \ I, we have for T>\\ := V H Aq and ao = |A)|> 

\\Pv\\l < {s -a )X 2 a 2 + (t ^+\\h Vll \\ 2 ) 2 . (6.2) 
Suppose that to < [3mm,A as defined in (6.1). Then (6.2) can be replaced by 

Wv\\\ < (SO - «o)A 2 CJ 2 + H/lD^II^ (/3min,ylo/(/5min,Ao - *o)) 2 • ( 6 - 3 ) 

Lemma 6.2. (Oracle Ideal MSE with bounds) Suppose that (1 .4) and (1.5) hold. Let (3- m n be an initial 
estimator. Let h = /3j n i t — (3t and X := a/2 logp/n. Suppose on some event Q c , for /3 mm ./i as defined 
in (6.1), it holds that 

/3min,A > H^olloo + minl^o) 172 !!^^, (so)' 1 ll^llx}- (6-4) 
Now we choose a thresholding parameter to such that on Q c , for some sq > sq, 

Pmin,A - II^Aolloo > *0 > mill | (s )~ 1/2 ||Anit,T c || 2 , (so)" 1 1 1 Anit,T c \ | i } ( 6 -5) 
holds and set I = {j : \(3j t - m i t \ > t }; Then we have onT a C\ Q c , 

Ao C I and \I n Tq | < so; and hence \I\ < so + so; (6-6) 
and ||/5x?|l2 — ( s o — ao)A 2 cr 2 . (6.7) 
For (3 j being the OLS estimator based on (Xj , Y ) and sq < s, we have onT a ri Q c , 

fa-P 

where C-j depends on 9^^$^ which is upper bounded by (A max (2s) — A mm (2s))/2. 



I < C 7 s X 2 a 2 /A 2 min (\I\) (6.8) 



By introducing sq, the dependency of to on the knowledge of sq is relaxed; in particular, it can be used to 
express a desirable level of sparsity for the model / that one wishes to select. We note that implicit in the 
statement of Lemma (6.2), we assume the knowledge of the bounds on various norms of — (3 (hence the 
name of "oracle"). Theorem 6.3 is an immediate corollary of Lemma 6.2, with the difference being: we now 
let so = so everywhere and assume having an upper estimate D\ of D\, so as not to depend on an "oracle" 
telling us an exact value. 

Theorem 6.3. Suppose that RE(so,6, X) condition holds. Choose X n > bX at a tP , where b > 2. Let f3i n n 
be the Lasso estimator as in (1.2). Suppose that for some constants D\ > D\, and for Do, D\ as in (5.2) 
and (5.3), it holds that 

/3 m in,ylo > DoXo^fso~+ DiXa, where X : = a/2 logp/n, 

Choose a thresholding parameter to and set 

I = {j : |/3j,init| > t }, where t > DiXa. 

Then onT a , (6.6), (6.7), and (6.8) all hold with sq = so everywhere and C7 < A min (|/|)+(A max (2s) — A m ; n (2s)) 2 /2+ 
4(1 + a); Moreover, the OLS estimator f3 based on I achieves on T a , for f(I) as defined in Corollary 4.4, 
where \I\ < 2sq, 



XI3j - Xp 



j\fn< Cg^/so~Xa where C 8 = \/ A max (s) + /(J). 
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Figure 1: Illustrative example: i.i.d. Gaussian ensemble; p = 256, n = 72, s = 8, and a = y/s/3. (a) 
compare with the Lasso estimator /3 which minimizes £2 loss. Here /3 has only 3 FPs, but p 2 is large with a 
value of 64.73. (b) Compare with the fi- m n obtained using A n . The dotted lines show the thresholding level 
to- The /?i n it has 15 FPs, all of which were cut after the 1st step; resulting p 2 = 12.73. After refitting with 
OLS in the 2nd step, for the j3, p 2 is further reduced to 0.51. 



6.1 Discussions 



Compared to Theorem 1.1, we now put a lower bound on P m i n ,A rather than on the entire set S in The- 
orem 6.3, with the hope to recover Aq. Choosing the set Aq is rather arbitrary; one could for example, 
consider the set of variables that are strictly above Xa/2 for instance. Bounds on ||/iAolloo w& * n g enera l 
harder to obtain than ||/ia || 2 ; Under stronger incoherence conditions, such bounds can be obtained; see 
for example Candes and Plan (2009); Lounici (2008); Wainwright (2009b). In general, we can still hope 
to bound ||/iAolloo II ^a 1 1 2- Having a tight bound on || hx \\ 2 (or || fob 1 1 00) an< ^ II fog 1 II 2 nat urally helps 
relaxing the requirement on fim\n,A for Lemma 6.2, while in Lemma 6.1, such tight upper bounds will help 
us to control both the size of / and \\fiv\\ and therefore achieve a tight bound on the £2 loss in the expression 
of Lemma 4.3. In general, when the strong signals are close to each other in their strength, then a small 
Anin,A implies that we are in a situation with low signal to noise ratio (low SNR); one needs to carefully 
tradeoff false positives with false negatives; this is shown in our experimental results in Section 7. We re- 
fer to Wainwright (2009a) and references therein for discussions on information theoretic limits on sparse 
recovery where the particular estimator is not specified. 



7 Numerical experiments 

In this section, we present results from numerical simulations designed to validate the theoretical analysis 
presented in previous sections. In our Thresholded Lasso implementation (we plan to release the imple- 
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mentation as an R package), we use a Two-step procedure as described in Section 1: we use the Lasso as 
the initial estimator, and OLS in the second step after thresholding. Specifically, we carry out the Lasso 
using procedure LARS(Y, X) that implements the LARS algorithm Efron et al. (2004) to calculate the full 
regularization path. We then use A n , whose expression is fixed throughout the experiments as follows, 

A n = 0.69Acr, where A = y/2\ogp/n, in (1.2) (7.1) 

to select a /3; n i t from this output path as our initial estimator. We then threshold the j3un t using a value to 
typically chosen between 0.5Act and Act. See each experiment for the actual value used. Given that columns 
of X being normalized to have £2 norm y/n, for each input parameter (3, we compute its SNR as follows: 

SNR:= \\P\\l/a 2 . 

To evaluate /3, we use metrics defined in Table 1 ; we also compute the ratio between squared £2 error and 
the ideal mean squared error, known as the p 2 ; see Section 7.3 for details. 

7.1 Illustrative example 

In the first example, we run the following experiment with a setup similar to what was used in Candes and Tao 
(2007) to conceptually compare the behavior of the Thresholded Lasso with the Gauss-Dantzig selector: 

1. Generate an Lid. Gaussian ensemble X nxp , where Xij ~ N(0, 1) are independent, which is then 
normalized to have column £2 -norm y/n. 

2. Select a support set S of size \S\ = s uniformly at random, and sample a vector (3 with independent 
and identically distributed entries on S as follows, = /^(l + \gi\), where p.\ = ±1 with probability 
1/2 and gi ~ N(0, 1). 

3. Compute Y = X(3 + e, where the noise e ~ N(0, o~ 2 I n ) is generated with I n being the n x n identity 
matrix. Then feed Y and X to the Thresholded Lasso with thresholding parameter being to to recover 
j3 using p. 

In Figure 1, we set p = 256, n = 72, s = 8, a = \Zs/3 and to = Act. We compare the Thresholded Lasso 
estimator (3 with the Lasso, where the full LARS regularization path is searched to find the optimal j3 that 
has the minimum £2 error. 

7.2 Type I/II errors 

We now evaluate the Thresholded Lasso estimator by comparing Type I/II errors under different values 
of to and SNR. We consider Gaussian random matrices for the design X with both diagonal and Toeplitz 
covariance. We refer to the former as Ltd. Gaussian ensemble and the latter as Toeplitz ensemble. In the 
Toeplitz case, the covariance is given by T(7) i)i = 7^ where < 7 < 1. We run under two noise levels: 
ct = \fs/3 and ct = y/s. For each a, we vary the threshold to from O.OIAct to 1.5Act. For each a and to 
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combination, we run the following experiment: First we generate X as in Step 1 above. After obtaining X, 
we keep it fixed and then repeat Steps 2 — 3 for 200 times with a new f3 and e generated each time and we 
count the number of Type I and II errors in (3. We compute the average at the end of 200 runs, which will 
correspond to one data point on the curves in Figure 2 (a) and (b). 

For both types of designs, similar behaviors are observed. For a = \/s/3, FNs increase slowly; hence there 
is a wide range of values from which to can be chosen such that FNs and FPs are both zero. In contrast, 
when a = sfs, FNs increase rather quickly as to increases due to the low SNR. It is clear that the low SNR 
and high correlation combination makes it the most challenging situation for variable selection, as predicted 
by our theoretical analysis and others. See discussions in Section 6. In (c) and (d), we run additional 
experiments for the low SNR case for Toeplitz ensembles. The performance is improved by increasing the 
sample size or lowering the correlation factor. 



Table 1 : Metrics for evaluating (3 



Metric 


Definition 


Type I errors or False Positives (FPs) 
Type II errors or False Negatives (FNs) 
True positives (TPs) 
True Negatives (TNs) 
False Positive Rate (FPR) 
True Positive Rate (TPR) 


# of incorrectly selected non-zeros in (3 

# of non-zeros in (3 that are not selected in /3 

# of correctly selected non-zeros 

# of zeros in (3 that are also zero in j3 
FPR = FP/(FP + TN) = FP/(p - s) 
TPR = TP /{TP + FN) = TP js 



7.3 i 2 loss 

We now compare the performance of the Thresholded Lasso with the ordinary Lasso by examining the 
metric p 2 defined as follows: 

2 S?=l(ft-ft) 2 

We first run the above experiment using i.i.d. Gaussian ensemble under the following thresholds: to = 
Act for a = -v/s/3, and to = 0.36Ac for a = -y/s. These are chosen based on the desire to have low 
errors of both types (as shown in Figure 2 (a)). Naturally, for low SNR cases, small to will reduce Type 
II errors. In practice, we suggest using cross-validations to choose the exact constants in front of Act. 
We plot the histograms of p 2 in Figure 2 (e) and (f). In (e), the mean and median are 1.45 and 1.01 
for the Thresholded Lasso, and 46.97 and 41.12 for the Lasso. In (f), the corresponding values are 7.26 
and 6.60 for the Thresholded Lasso and 10.50 and 10.01 for the Lasso. With high SNR, the Thresholded 
Lasso performs extremely well; with low SNR, the improvement of the Thresholded Lasso over the ordinary 
Lasso is less prominent; this is in close correspondence with the Gauss-Dantzig selector's behavior as shown 
by Candes and Tao (2007). 

Next we run the above experiment under different sparsity values of s. We again use i.i.d. Gaussian ensemble 
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Figure 2: p = 256 s = 8. (a) (b) Type l/ll errors for i.i.d. Gaussian and Toeplitz ensembles. Each vertical 
bar represents ±1 std. The unit of x-axis is in Act. For both types of design matrices, FPs decrease and FNs 
increase as the threshold increases. For Toeplitz ensembles, in (c) with fixed correlation 7, FNs decrease 
with more samples, and in (d) with fixed sample size, FNs decrease as the correlation 7 decreases, (e) (f) 
Histograms of p 2 under i.i.d Gaussian ensembles from 500 runs. 
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with p = 2000, n = 400, and a = y/s/3. The threshold is set at to = Act. The SNR for different s is fixed 
at around 32.36. Table 2 shows the mean of the p 2 for the Lasso and the Thresholded Lasso estimators. The 
Thresholded Lasso performs consistently better than the ordinary Lasso until about s = 80, after which both 
break down. For the Lasso, we always choose from the full regularization path the optimal (3 that has the 
minimum £2 loss. 



Table 2: p 2 under different sparsity and fixed SNR. Average over 100 runs for each s. 



s 


5 


18 


20 


40 


60 


80 


100 


SNR 


34.66 


32.99 


32.29 


32.08 


32.28 


32.56 


32.54 


Lasso 


17.42 


22.01 


44.89 


52.68 


31.88 


29.40 


47.63 


Thresholded Lasso 


1.02 


0.96 


1.11 


1.54 


10.32 


29.38 


53.81 



7.4 Linear Sparsity 

We next present results demonstrating that the Thresholded Lasso recovers a sparse model using a small 
number of samples per non-zero component in (3 when X is a subgaussian ensemble. We run under three 
cases of p = 256, 512, 1024; for each p, we increase the sparsity s by roughly equal steps from s = 
0.2p/log(0.2p) to p/4. For each p and s, we run with different sample size n. For each tuple (n,p, s), we 
run an experiment similar to the one described in Section 7.2 with an i.i.d. Gaussian ensemble X being 
fixed while repeating Steps 2 — 3 100 times. In Step 2, each randomly selected non-zero coordinate of 
(3 is assigned a value of ±0.9 with probability 1/2. After each run, we compare f3 with the true (3; if all 
components match in signs, we count this experiment as a success. At the end of the 100 runs, we compute 
the percentage of successful runs as the probability of success. We compare with the ordinary Lasso, for 
which we search over the full regularization path of LARS and choose the f3 that best matches (3 in terms of 
support. 

We experiment with a = lander = \fs/3. For a = 1, we set to = ft\J\So\ Act, where 5o = {j : /3j,init > 0.5A n = 0.35Act} 

for A n as in (7.1), and ft is chosen from the range of [0.12,0.24] (cf. Section 3). For a = i/s/3, we set 

to = 0.7 Act with SNR being fixed. The results are shown in Figure 3. We observe that under both noise 

levels, the Thresholded Lasso estimator requires much fewer samples than the ordinary lasso in order to 

conduct exact recovery of the sparsity pattern of the true linear model when all non-zero components are 

sufficiently large. When a is fixed as s increases, the SNR is increasing; the experimental results illustrate 

the behavior of sparse recovery when it is close to the noiseless setting. Given the same sparsity, more 

samples are required for the low SNR case to reach the same level of success rate. Similar behavior was also 

observed for Toeplitz and Bernoulli ensembles with i.i.d. ±1 entries. 

7.5 ROC comparison 

We now compare the performance of the Thresholded Lasso estimator with the Lasso and the Adaptive 
Lasso by examining their ROC curves. Our parameters are p = 512, n = 330, s = 64 and we run under 
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Figure 3: (a) (b) Compare the probability of success for p = 256 and p = 512 under two noise levels. The 
Thresholded Lasso estimator requires much fewer samples than the ordinary Lasso, (c) (d) (e) show the 
probability of success of the Thresholded Lasso under different levels of sparsity and noise levels when n 
increases for p = 512 and 1024. (f) The number of samples n increases almost linearly with s for p = 1024. 
More samples are required to achieve the same level of success when a = ^fs / 3 due to the relatively low 
SNR. 
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Figure 4: p = 512 n = 330 s = 64. ROC for the Thresholded Lasso, ordinary Lasso and Adaptive Lasso. 
The Thresholded Lasso clearly outperforms the ordinary Lasso and the Adaptive Lasso for both high and 
low SNRs. 

two cases: a = \fs/3 and a = yfs. In the Thresholded Lasso, we vary the threshold level from 0.01 Act to 
1.5A<7. For each threshold, we run the experiment described in Section 7.2 with an i.i.d. Gaussian ensemble 
X being fixed while repeating Steps 2 — 3 100 times. After each run, we compute the FPR and TPR of the (3, 
and compute their averages after 100 runs as the FPR and TPR for this threshold. For the Lasso, we compute 
the FPR and TPR for each output vector along its entire regularization path. For the Adaptive Lasso, we use 
the optimal output (3 in terms of £2 loss from the initial Lasso penalization path as the input to its second 
step, that is, we set Pi n n := (3 and use Wj = 1/Anitj to compute the weights for penalizing those non-zero 
components in (3^ in the second step, while all zero components of fii^t are now removed. We then compute 
the FPR and TPR for each vector that we obtain from the second step's LARS output. We implement the 
algorithms as given in Zou (2006), the details of which are omitted here as its implementation has become 
standard. The ROC curves are plotted in Figure 4. The Thresholded Lasso performs better than both the 
ordinary Lasso and the Adaptive Lasso; its advantage is more apparent when the SNR is high. 



8 Conclusion 

In this paper, we show that the thresholding method is effective in variable selection and accurate in statistical 
estimation. It improves the ordinary Lasso in significant ways. For example, we allow very significant 
number of non-zero elements in the true parameter, for which the ordinary Lasso would have failed. On the 
theoretical side, we show that if X obeys the RE condition and if the true parameter is sufficiently sparse, 
the Thresholded Lasso achieves the £2 loss within a logarithmic factor of the ideal mean square error one 
would achieve with an oracle, while selecting a sufficiently sparse model /. This is accomplished when 
threshold level is at about a/2 log p/na, assuming that columns of X have £2 norm sjn. We also report a 
similar result on the Gauss-Dantzig selector under the UUP, built upon results from Candes and Tao (2007). 



24 



When the SNR is high, almost exact recovery of the non-zeros in f3 is possible as shown in our theory; 
exact recovery of the support of /3 is shown in our simulation study when n is only linear in s for several 
Gaussian and Bernoulli random ensembles. When the SNR is relatively low, the inference task is difficult 
for any estimator. In this case, we show that Thresholded Lasso tradeoffs Type I and II errors nicely: we 
recommend choosing the thresholding parameter conservatively. Algorithmic issues such as how to get an 
estimate on a and parameters related to the incoherence conditions is left as future work. While the current 
focus is on £2 loss, we are also interested in exploring the sparsity oracle inequalities for the Thresholded 
Lasso under the RE condition as studied in Bickel et al. (2009) in our future work. 



A Proof of Theorem 1.1 

Proving Theorem 1.1 involves showing that the Lasso and the Dantzig selector satisfy (3.2). These have 
been proved in Bickel et al. (2009). Theorem 1.1 is then an immediate corollary of Theorem 3.1 under 
assumptions therein. We note that on T a , it holds that 1 1 ^init,S c 1 1 x < ko ||winit,s , || 1 , where ko = 1 for the 
Dantzig selector when X n > \ a , a ,p and ko = 3 for the Lasso, when X n > 2A< Jifl) p for the Lasso. Then on 
T a as in (1.15), (3.2) holds with B = AK 2 (s,3) and B x = 3K 2 (s,3) for Lasso under RE(s,3,X) and 
(3.2) holds with B = B x = 4K 2 (s, 1) for the Dantzig selector under RE(s, 1, X); See Zhou (2009a) for 
deriving the exact constants here. □ 



B Proof of Theorem 1.4 



Proof of Theorem 1.4. It is clear by construction that under T a , Xf3j = PjY and \I\ < 2sq. Hence 

Xfa-Xpj /yfii = ||(Pj-ld)X/3 + P je || 2 /V^ 
< ||X/c/3o|| 2 /y/n+ ||P/e|| 2 l^fn 



< V A max(s) HM2 + 



V|/|(l + a)A max (|J|)A(7 

Amin(|/|) 



where we have on T a , for X a , a ,p = Vl + aXa, where A = ^2\ogp/n, 

|| X^Xf X^XfeWz /y/K < || X 7 (XjAVn)-7\/^|| 2 \\xf e/n\\ 2 

< VAmax(|/|)v1T|A (T , a ,p < V|/|(l + a)A max (|/|)Aa 
A min (|/|) A min (|/|) 

Now by Lemma 4.2 and 5.2, we have ||/3x>|| 2 < C^/soXa for some constant C. 



□ 
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C Proof of Proposition 1.5 



Recall that \(3j\ < Xa for all j > a as defined in (6.1); hence for A = -y/2 log p/n, we have by (G.l), 

Ef >ao min(/3 2 , A 2 cr 2 ) = £* >O0 P? < fa> - «o)A 2 a 2 ; hence 



{j eA^:\/3j\> y/log p/(dn)a} < 2c'(s - a ) where \T \ A \ = s - a - 
Now given that /3, > /3j for all i € Tq, j G Tq, the proposition holds. 



□ 



D Proof of Theorem 3.1 



We first state two lemmas. Define Ui n i t = fiimt — and = (3^ — (5. 
Lemma D.l. Under assumptions in Theorem 3.1, suppose onT a C\ Qb, 



Pmin > H + T where E := 



max 

i=0,l 



,(') 



and r := 



max t i 
«=o,i 



(D.l) 



Then 5 C ^ C Si. 



Proof. We have Vj G S /3 initJ > /3 min - || 



H > T = i n and 



> B ■ - 

Pj _ A-'min 



,(1) 



> /? m j n — H > T > ii . Thus the lemma holds by definition of Si, for i = 0, 1 , 2. □ 



(D.2) 



The following lemma follows from Lemma 4.3, by plugging in ||px>|| 2 = 0. 

Lemma D.2. (£2 -loss for the OLS estimators) Suppose that I 5 S and \I\ < 2s, then the OLS estimator 
Pi := (XfX I )~ 1 XfY satisfies on T a , - /3 1 1 2 < A CT)0;P y / |l[/A m i I1 (|I|) which satisfies (3.4) with 
B 2 = l/(SA min (2a)). 

Proof 'of Theorem 3.1. It is clear by construction that 

^2 c Si c So- 

Recall that So is obtained by thresholding /?i n i t with 4A n , hence by (3.2), we have 

.3 . ll«tait,S»|li ^ #lA n s Bis 

|5(A51 -^- - IaT-— ■ 

1. If Si < 4, we have that |S | < 2s; 

2. Otherwise, we have |S | < s + l?is/4 < B±s/2. 

Hence for ij = 4A n y |Sj|, Vi = 0, 1 and T as in (D.l), it holds by (D.2) that 



r = t = 4A n \/^j < \ n ^Ts max {2y/W u 4v^) . 



(D.3) 
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Now given (3.3) and (3.2), we have Vj G S, 



Anitj > /5min - H^init^lloo > /^min — ||^init,5|| 2 > T = t 05 



and hence it holds that 5 C Si C So by construction of Si, and hence to > 4A n -y/s. Now by (3.2), we have 
for s > Bj/16, 



pi \ S < < — = < — j— < s; and Si < 2s. 

to 4A„Vs 4 

For the OLS estimator /jW with 7 = Si, by Lemma D.2, we have on T a 



(DA) 



- & < < * nV %, < B 2 X n V2S, where fll := 

2 Amin(si) 5A min (2s) 



5l 



where X n > BX a ^ )P , for A CT) a iP as in (1.15), and B 2 = l/(7?A m i n (2s)). Clearly we have by definition of I 
in(D.l), 



H < max 

2=0,1 



.(*) 



< max{ Bq , V2B 2 } X n Vs 



and thus /3 min > H + V holds given (3.3) and (D.3). By Lemma D.l, we have Si D S, Vi = 0, 1, 2. It 



remains to show (3.5) and (3.4); Upon thresholding /?W with ii, we have for si := 

^a,a,py/si 1 \ 2 - 



Si 



and A n > BX aap , 



\S 2 \S\ < 



< 



A min (si) 4A„V^Ty " IGT^A^Jsi) 



Now for the final estimator in (3.1), we have onT a ^Qb by Lemma D.2, 



A^a.pVl^l/Amind^l) < \ n B 2 y/2s. 



□ 



E Proofs for the Gauss-Dantzig selector 

Recall /3i n it is the solution to the Dantzig selector. We write /3 = /5K 1 ) + f3^ where 

Pf } =Prh< j < S0 and (3f =(3 r l J>SQ . 

Let h = /3j n it — where (3^ is hard-thresholded version of /3, localized to To = {1, ... , so}. Let T\ be the 
so largest positions of h outside of To; Let T01 = To U Ti. The proof of Proposition 4. l(cf. Candes and Tao 
(2007)) yields the following: 

||/iToill 2 < ^A PiT ff^, for C' as in (4.1) (E.l) 
||/it c ||i < CiAp >r o-s , where d = ^C + T~fZ~g V and ( E - 2 ) 
||^i|| 2 < ll^o Hi - CiVrO"\/soi (cf. Lemma F.2). (E.3) 
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Proof of Lemma 4.2. Consider the set / n Tq := {j £ Tq : \Pj,imt\ > to}. It is clear by definition of 
h = Anit - P {1) and (E.2) that 



|/nT c | < ||/3zj jini t|| 1 /to = llHlli Ao < s o, 



(E.4) 



where t > Ci\ PtT a. Thus \I\ = \I n T \ + \I n T C | < 2s ; Now (1.16) holds given (E.4) and | J U S| = 
\S\ + |J n 5 C | < s + |I n T C | < s + so- We now bound \\Pv\\l- By (E.l) and (6.2), where V u C lb, we 
have for to < ^Ap^cr^/io, 

llifella < (*o - ao)AV + (t ^+ \\h To \\ 2 ) 2 < ((C 4 + Q 2 + l)A 2 jr a 2 - □ 
Proo/ of Lemma 4.3. Note that XicPjc = Xs^Ps-p- We have 

% = (XfX^XfY = {XjX I )- x Xj(X I p I + Xi=/3jc + e) 
= /9j + (XfX^XfXsvPsv + (XfXj^Xfe; 

= II i x J Xj Xs v j3s v + (Xf Xj) -1 Xje|L 



Hence 



T 



where the second term is bounded as Lemma D.2: we have on T a , 



(XjX^Xje L< 











2 


n 



< 



A min (|/|) 



Acr a,; 



(E.5) 



(E.6) 



by (1.8), where X a ,a,p = \/l + a\a for A = -y/logp/n. We now focus on bounding the first term in (E.5). 
Let P/ denote the orthogonal projection onto /. Let 

c = (XfX^XjXsvPsv, hence Xjc = PiX Sv P Sv . 

By the disjointness of I and St>, we have for PiXs^Ps-v '■= Xjc, 



\Pi X St>PS' 



■d 112 



< n0 



1^1,1^1 



l c ll 2 II^sdII 2 where 



< iwi, „ iip/^fe,ii 2 



T>yo-p ii 2 



< 



^/nA min (|I|) " ^nA min (|/|) ' 

V A min(|/|) 



where 



(E.l) 
(E.8) 



and ||c|| 2 < imSv] 



?x,|| 2 /A mm (|i"|). Now we have on T a , by (E.6), 



Pi -Pi 



< II (XfXj^XfXsvPsv L + II (XfX^Xfe 



< 



A 



Now the lemma holds given 



Pl-P 



mm 
2 
2 



Pi -Pi 



'T>\\2 + 



A min (|/|) 



Xcr,a,p- 



+ 



□ 
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Proof of Theorem 1.2. It holds by definition of Su that / n S-p = 0- It is clear by Lemma 4.2 that 

\S V \ < s and |/| < 2s and \I U S v \ < \I U S\ < s + s < 2s; Thus for 0/ = {XjX^XfY, we have 
for A = y/2logp/n, and by (4.5) 

2 <r im i|2 A , 2g m.|gpl ^ 2 I J I x2 

< AV. ((^ + ^ + C i? + 1) (l + jggj) + ^) . 
Thus the theorem holds for C3 as in (4.3) by (4.5), where it holds for r > that 

6s,2s < @s,2s < 1 &2s T ^ 
A m in(2so) ~~ Amin(2so) ~~ A min (2s) 

given that 9 S ^ S < 1 — t — 62s < A m i n (2s) for r > 0. □ 



F Oracle properties of the Lasso 

We first show Lemma F. 1 , which gives us the prediction error using {3t ■ 
Lemma F.l. Suppose that (1.5) holds. We have for A = \J (2 logp) jn. 



\\XP-Xp To \\ 2 /^ < VAmax(s - s )Xa^. (F.l) 
Proof The lemma holds given that ||#z*|| 2 < Acr^, and \\X0 - Xfi To || 2 /y/n = \\xpj*\\ 2 /y/n < 

\/Amax(s - S ) \\Pt§\\ 2 ■ □ 

We then state Lemma F.2, followed by the proof of Theorem 5.1, where we do not focus on obtaining the 
best constants. Lemma F.2 is the same ( up to normalization) as Lemma 3.1 in Candes and Tao (2007). 
We note that in their original statement, the UUP condition is assumed; a careful examination of their proof 
shows that it is a sufficient but not necessary condition; indeed we only need to assume that A m ; n (2so) > 
and 9 SOt 2s < 00, as we show below. The proof is included by the end of this section for the purpose of a 
self-complete presentation. 

Lemma F.2. Suppose A m i n (2so) > and S(h 2 SO < 00. Then 

\\h T \L < 1 — \\xh\L+ ® S0 ' 2S( ! rllML 

01 " 2 - V A min(2* )^ V^W^o)" 0,11 

11 1 1 2 11 1 1 2 — ^ 9 11 1 1 2 

Pt c J 2 < Pt c |Ii 2^ l / k - \\ hT o\\i/ s ° andthus 

k>s +l 

< II^To! II2 + s 1 11^ || 2 
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Proof of Theorem 5.1. Throughout this proof, we assume that T a holds. We use P := f3- m n to represent 
the solution to the Lasso estimator in (1.2); By the optimality of j3, we have 



1 

2n 

Y -X/3 



Y — X/3 



^Wy-xmIkkWMi-** 



where 



(F.2) 



Xp - Xp + e 



X/3 -X/3 



+ 2(p - p) T X T e+\\e 



and similarly, we have for Pq = Pt . 



\\Y - XfoWl = W ~ X Po + 4l = W X P ~ X Ml + 2(0 - Po) T X T e + \\e\\ 2 2 ; 
Let h = (3 — (3q. Thus by (F.2) and the triangle inequality, we have on T a 

2 



X/3-XI3 



n 



< 



< 



< 



\\xp - Xp 



+ 



2h T X 



T, 



n 



n 



\\XP-XPo\ 



+ 2 



n 



- + 2A 

X T e 



n 



blli - 11^ + A)lli) 
+ 2A n (||Hlli-||K c lli) 



\\XP-XPo\ 



- + 3A n ||/it |Ii — X n H^t^IIj , 



where we have used the fact that A n > 2\ a a p for a > 0; Thus we have on T a , 

l/n + XnWhr^ < WXp-XpoWl/n + SXnWhTcWi, 

which is also the starting point of our analysis on the oracle inequalities of the Lasso estimator. Now we 
differentiate between two cases. 



Xf3-Xj3 



(F.3) 



1. Suppose that on T a , \\Xp — X(3q\\^ jn > 3A n ||^t Hi- We then have that 

xp-xp' u 



"n 



/n + \ n \\h T «\\ 1 <2\\XP-Xp \\l/t 

and hence for A n = doXa, where d > 2, we have by Lemma F.l, 

Whrg^ < 2A m ax(s - s )Xas /d < A max (s - s )Xas . 

Now by (F.3), we have 

1 < \\Xp-Xp \\ 2 2 /(nX n ) + 411/^11! 

< 7\\XP-Xp \\l/(3nX n ) < 7A max (s-s )A(JSo/(3do) and clearly 

xp- xp 



(F.4) 



\Xh\\ 2 < 



By Lemma F.2, we have on T a , 

1 



I^T ill 2 < 



A/A max (s - s ) 



+ \\xp-xp \\ 2 < {V2 + i)\\xp-xp Q \\ 2 . 
\\ X H 2 



's ,2s 



IT; 



Ml 



< Actt/so" 



A m in(2so)\/^0 

h ,2s \J A m ax(s — Sq) 



yj A m in(2so) 



{V2 + l) + 



\Mmm(2so) 



DXa^To, for D = (V2>1)^3L=^ + ^ 



A ma x(s - S ) 



\/A m in(2so) 



A m in(2so) 
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2. Otherwise, suppose on T a , we have \\Xj3 — Xf3o\\^ /n < 3X n \\Ht thus 

XP- X/3|[ /(nA re ) + imHIi < 6 
and ll^r^llj < 6 ||/ir which under the RE(sq, 6, X) condition immediately implies that 

\\hT \\ 2 <K(so,6)\\Xh\\ 2 /VE. 
The rest of the proof is devoted to this second case. 
We use K := K(sq, 6) as a shorthand below. By (F.3), we have on T a , 



(F.5) 



Jn + XnlMh-WXP-XPoWl/n 



Xp~Xp 



< 3A„ H/ztoIU <3Wio||frr |l2 < 3KX ^f*° \\Xh\\ 2 (by (F.5)) 



n 



3KX n J~s^ 



xp-xp 



+ 3KX n ^\\XP-Xp \\ 2 /V^ 



(F6) 



xp-xp 



/n + (3KX n ^) 



< 3KX n ^\\Xp-Xp \\ 2 /^ + 
from which the following immediately follows: for A n = d^Xa > 2X a ^ a)P , we have 

||K c lli ^ \\Xp-Xp \\l/{nX n ) + 3K^\\Xp-Xp \\ 2 /y/n-+(3K/2) 2 X n s 
)\Xp - Xp \\ 2 /,/^X n + (3K/2)^/xX) 2 ■= D'.Xaso 



where D[ = (\/A max (s - s )/do+3K(so, §)\fdo/2) 2 . Similarly, we can derive abound on \\h\lx from (F.3); 
we have on T a , 



XP -Xp /n + X n Whrc^ + X n \\hrX ~ \\Xp - xp f 2 jn < 4A n , \\hrX 



<4 X n ^\\h To \\ 2 <4;KX n ^\\Xh\\ 2 /Vh~ (by (F.5)) 



< 4KX n ^\\Xp-Xp \\ 2 /V7i+ XP-XP Jn + (2KX n ^Y 

Hence it is clear that for X n = d^Xa > 2X f7A . p , we have by Lemma F.2, on T a , 

11% < \\Xp-Xp \\ 2 2 /(nX n )+AK^\\XP-Xp \\ 2 /^i + AK 2 X n s Q 
= ^\Xp-XP4 2 /^fnXi + 2K^\X) 2 = D 2 Xas 



where D 2 = (a/ A max (s — so)/^o + 2K(sq, 6)^/do) 2 . Now we derive a bound for 



xp-xp 



jn; our 



starting point is (F.6), from which by shifting items around and adding (3KX n ^fso) 2 to both sides, we obtain 



XP -XP jn- 3KX n ^ 



xp-xp 



/Vn + (3KX n ^/s^) 2 + X n \\hTgWi 



< \\XP - XPo\\l In + 3KX n yfs^ \\Xp - Xp \\ 2 /^ + (3KX n ^/2) 
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Thus we have for A n = do Act > 2A CTiaiP , 



1 



and hence 



xp-xp 



3K\ ny /s~o~ 



xp-xp 



+ A n ||/ iT(f || 1 < {\\Xp-Xp \\ 2 /V^ + 3K\ n ^/2)' 



/Vn~ < \\XP- Xp \\ 2 /^ + 3KX n ^ 



(F.7) 



< A max (s - s ) + 3d K(s , 6) J . 

by Lemma F.2. Under RE (so, 6, X) condition, we have by (F.7) 

K(a ,6) 



\h To \\ 2 < K(s ,6)\\Xh\\ 2 /yfti< 



n 



xp-xp 



+ 



\\xp-xp \\ 2 ) 



< K(s , 6) (2 \\XP - Xp \\ 2 IJH + 3K( S(h 6)A n ^) 

< Xa^/soK (s , 6) (2V 7 A max (s - so) + 3d -^(so, 6)). 

Let T\ be the so largest positions of h outside of To; Now by a property as derived in Zhou (2009a) (Propo- 
sition A. 1), we also have for K := K(sq, 6). 



IH1II2 ^ VV^K{s ,6) \\Xh\\ 2 < \aV2^K{2^/A r 



s ) + 3d K) < D Xay/s^ 



Moreover, we have by Lemma F.2, 



P-P 



< 2 



P-Pt +2\\P-p To \\ 2 2 <2 



2 + 2AVs 



< 2(11^ || 2 + \\h TS \\l /s ) + 2AVs < 2\ 2 a 2 s (D 2 + D\ + 1) 
We note that (5.1) holds given (F.4) and (F.7). 



□ 



Remark F.3. We could have bounded \\hx 01 \ \ 2 for the second case also by Lemma F.2; we take the form 
here for simplicity. 

Proof of Lemma F.2. Decompose hr^ into hr 2 , ■ ■ ■ , hr K such that T 2 corresponds to locations of 
the so largest coefficients of hr^ in absolute values, and T3 corresponds to locations of the next so largest 
coefficients of hr^ in absolute values, and so on. Let V be the span of columns of Xj, where j € Tq\, and 
Py be the orthogonal projection onto V. Decompose PyXh: 

PyXh = P v Xh Toi + ^ PvXh T] = Xh Tm + ^ P vXh T] , where 

i>2 j>2 



\PvXh Tj \\ 2 < Amm(2so) 



Nll 2 and Yl IIHII2 ^ IMIiA/*0 
J>2 



see Candes and Tao (2007)) for details; Thus we have 



\\Xh T( 



01 112 



P v Xh - Y PvXh T] 



i>2 



< ||iv*% + 



^PyXh Tj 

i>2 



< \\Xh\\ 2 + £ \\PyXh T] || 2 < \\Xh\\ 2 + ; /^l ' 2 ? - \\hr, 



i>2 



-\/-^-min(2so)\/*0 



111 
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where we used the fact that || JVH2 — 1- Hence the lemma follows given \\hr 01 1| 2 < 7 1 1 1 ^C/iTbi 1 1 2 

yj -'^min (2s )v^ U Z 

For other bounds, the fact that the Mi largest value of /it? obeys |/it c |/ M < !l n T r c ll-i /k has been used; 

see Candes and Tao (2007). □ 



G Proofs for Lemmas in Section 6 



Let A = y / 2logp/n. By definition of so as in (1.14), we have Yn=i mm (A 2 ) A 2 cx 2 ) < sq\ 2 g 2 . We write 
P = pi 11 ) + /?( 12 ) + /3( 2 ) where 

_ fl.. 1. . «( 12 ) - fl. .1 . . «(2) 







#j ' ll<3<ao! & 



«0 <j<«0 5 



and = & • 1 



j>so- 



Now it is clear that X^<a min(/3 2 , A 2 <7 2 ) = aoA 2 cr 2 and hence 

^min(/3 2 ,AV)= ^ + 0< 2 > * < (s - a )AV 



i>ao 



(G.l) 



,(n) 



Proof of Lemma 6.1. It is clear for T> n = V n A , we have £>n C A C T C 5. Let 
(/3j)j e ^4 nD consist of coefficients of j3 that are above Act in their absolute values but are dropped as Pj t mt < 
to. Now by (G.l), we have 



->V\\2 



< 



+ 



^(12) + p(2) 



< 



+ (so - o )A cr , 



where |2?n| < ao and thus we have by the triangle inequality, 



< ||/3x>ii,init|| 2 + /^>n 

< *0\/^0 + 11^7511 H2 ! 



11 112 



(G.2) 



Thus (6.2) holds. Now we replace the bound of \T>n\ < ao with |Pn| < 



J ll 112 



-*ol : 



in (G.2) to obtain 



/4 n> 



< to JA2llJk 

2 Pmin.An 



to 



+ II^X>iill2 = Il^lill2 

Pmin.Ao ~~ r 



which proves (6.3). 



□ 



Proof of Lemma 6.2. Suppose T a ^Q c holds. It is clear by the choice of to hi (6.5) and by (6.4) that 
miiiieAo ft — ftnin Ao - II^Aolloo ^ *o and ^11 = 0- Thus by (6.5), we can bound \I n T C |, depending on 
which one is applicable, by |7n Tq | < H^T^initl^/to < so or by 1 7 n Tq | < 1 1 /ST^^nit 1 1 2/^0 — so- Moreover, 



the bounds on 



P7-/3 



Lemma 5.4 given |7| + 



follows immediately from Lemma 4.3, on event T a , where O 2 ^ , s , is bounded in 

Sv\ < s + |/nr c | < s + s < 2s. □ 
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